Week 4 of 12 · Part A — Applied Safety

Evaluation vs Red-Teaming

Two different jobs — one finds new failures, the other measures known ones repeatably

Day 16 ~60 minutes Concept

Day 16 of 60

The week's pivot: from finding to measuring

Week 3 you red-teamed — you went hunting for failures nobody had catalogued yet. This week you switch jobs. Evaluation takes the failures you (and others) already know about and measures them repeatably: same prompts, same scoring, run again next week and next model, so you can answer the only question a release decision actually turns on — is this model getting safer or worse?

The thesis

Red-teaming is discovery; evaluation is measurement. A red-team is creative, open-ended, and never the same twice — that's the point. An eval is frozen, scored, and re-runnable — that's its point. Confusing the two gives you a red-team you can't trust as a trend line, or an eval that never finds anything new.

Today is conceptual: you install the distinction cleanly, because every later day this week (the scorecard, the eval set, the harness) is downstream of getting it right. A safety eval that's flattering-but-broken is worse than none — it ships a dangerous model with a green checkmark.

How they differ — three axes

Core Theory

1 · Goal — novelty vs repeatability

Red-teaming maximizes coverage of the unknown: surprise me, find the failure I didn't anticipate. Evaluation maximizes repeatability: give me the same number under the same conditions so I can compare across time and models. You can't optimize both in one activity.

2 · Output — incidents vs metrics

A red-team produces incidents — concrete examples of a model doing something bad, often one-offs. An eval produces metrics — rates over a fixed set (safe-refusal rate, harmful-compliance rate). Incidents become evals when you freeze them into a scored test.

3 · Failure mode — staleness vs blindness

Evals go stale: once a test set is known, models can be tuned to it (or it leaks into training — contamination). Red-teams go blind: a tired team stops being creative and re-finds the same things. Each covers the other's weakness, which is why mature safety programs run both.

The loop between them

They feed each other. A red-team finds a new jailbreak → you turn it into eval cases → the eval tracks whether the next model still falls for it → when the eval saturates (everyone passes), you red-team again for the next unknown. Discovery refills the measurement set; measurement tells you when to go discover more.

What a dangerous-capability eval targets

Not all evals measure the same thing. The highest-stakes ones — dangerous-capability evaluations — don't ask "did the model say something rude?" They ask whether a model has crossed a threshold of capability that would matter for catastrophic risk: persuasion, cyber-offense, self-proliferation. DeepMind's Evaluating Frontier Models for Dangerous Capabilities is the reference for how a lab designs these, and it shows the rigor a real eval demands: clear capability definitions, graded difficulty, and honest reporting of uncertainty.

Why this matters for the rest of the week

The eval you build (Day 18) is a small, behavioral one — refusals, not self-proliferation — but it inherits the same discipline: define what you're measuring precisely, fix the conditions, and report a number you'd defend in a review. Scale changes; the craft doesn't.

The contamination trap

The single most common way an eval lies: contamination. If your test prompts (or close paraphrases) appeared in the model's training data, a high score measures memorization, not safety. The same trap catches public benchmarks the moment they're popular enough to be scraped. Before you trust any benchmark number, ask: could the model have seen this set? Anthropic's Challenges in Evaluating AI Systems is an honest tour of this and the other ways evals quietly mislead.

Your work today

Read + Draw the Line

~60 minutes

Read §1–3 of Evaluating Frontier Models for Dangerous Capabilities (Phuong et al., 2024). Note what makes a capability eval repeatable — what's frozen and what's scored.
Read Challenges in Evaluating AI Systems for the failure modes — contamination, multiple-choice artifacts, judge bias — and write down which one you find scariest and why.
In a notebook, write a two-column table: pick one safety concern (say, a jailbreak you imagined in Week 3) and describe how you'd red-team it vs how you'd evaluate it. Make the difference concrete.
Critique one benchmark honestly: name something its number does not tell you.

The expert move

A beginner uses "red-team" and "eval" interchangeably. An expert keeps them surgically separate, because they answer different questions and fail in different ways — and an expert knows the real value is the loop between them: discovery feeds measurement, and saturated measurement signals it's time to discover again. Owning that loop is owning whether a release decision rests on a number you can defend.

Say this in an interview: "Red-teaming is discovery, evaluation is measurement — I don't conflate them. Red-teams find the unknown failures; evals freeze the known ones into a repeatable score so I can tell whether a model is improving. And I treat contamination as the first thing to rule out, because a benchmark a model trained on measures memory, not safety."

Today's Takeaways

Red-teaming is discovery (novel failures, one-offs); evaluation is measurement (frozen set, repeatable score).
They form a loop: red-team finds it, eval tracks it, saturation sends you back to red-teaming.
Dangerous-capability evals measure thresholds that matter for catastrophic risk, with the same craft as small behavioral ones.
Contamination is the first lie to rule out — a score on data the model trained on measures memory, not safety.